Arrival Delay will be represented by Random Variable X
Departure Delay will be represented Random Variable Y

1. First we look at the skew of both ArrDelay and DepDelay

The Following Histogram is showing the distribution of Arrival Delays having positive skew

library(ggplot2)
ggplot(hflights, aes(ArrDelay)) + geom_histogram() + xlim(-50, 250)

The Following Histogram is showing the distribution of Departure Delays also having positive skew

ggplot(hflights, aes(DepDelay)) + geom_histogram() + xlim(-50, 250)

2. Subsetting hflights and removing NA values from dataframe

h=subset(hflights,select=c(ArrDelay,DepDelay))
h1=na.omit(h)

x= h1$ArrDelay
y= h1$DepDelay

length(x)
## [1] 223874
length(y)
## [1] 223874
quantile(x)
##   0%  25%  50%  75% 100% 
##  -70   -8    0   11  978
quantile(y)
##   0%  25%  50%  75% 100% 
##  -33   -3    0    9  981

Based on data above, We will be using 11 for 3rd Quartile of x and 0 for 2nd quartile of y

Next we are Calculating Values for x/y table

  1. X<=11,y<=0
p1=nrow(subset(h1, ArrDelay <= 11 & DepDelay <= 0))
p1
## [1] 108141
  1. X<=11, y>0
p2=nrow(subset(h1, ArrDelay <= 11 & DepDelay > 0))
p2
## [1] 61026
  1. X>11, Y<=0
p3=nrow(subset(h1, ArrDelay > 11 & DepDelay <= 0))
p3
## [1] 6159
  1. X>11, Y>0
p4=nrow(subset(h1, ArrDelay > 11 & DepDelay > 0))
p4
## [1] 48548

Table of counts

table = matrix(c(p1,p3,p1+p3,p2,p4,p2+p4,p1+p2,p3+p4,p1+p2+p3+p4), nrow=3, ncol=3)
colnames(table) = c("<=Q2",">Q2","Total")
rownames(table) = c("<=Q3",">Q3","Total")
table
##         <=Q2    >Q2  Total
## <=Q3  108141  61026 169167
## >Q3     6159  48548  54707
## Total 114300 109574 223874

Calculating a. P(X>x | Y>y)

a=(48548/223874)/.5
a
## [1] 0.4337082

Calculating b.P(X>x, Y>y)

b=(54707/223874)*(109574/223874)
b
## [1] 0.1196033

Calculating P (X <x | Y<y)

c =(61026/223874)/.5
c
## [1] 0.5451817

P(A|B)=P(A)P(B)?

A = 54707/223874
B = 109574/223874

Calculate P(A|B)

(48548/223874)/.5
## [1] 0.4337082

Calculate P(a)*P(B)

A*B
## [1] 0.1196033

Based on above calculations we can conclude that P(A|B) != P(A)P(B)

t1=c(x,y)
t2=table(t1)
chisq.test(t2)
## 
##  Chi-squared test for given probabilities
## 
## data:  t2
## X-squared = 5952700, df = 519, p-value < 2.2e-16

Based on above Chi-squared test we see that p<0.05, therefore we reject the Hypotheses that Arrival Delay and Departure Delay are independent.

2. Descriptive and Inferential Statistics.

Below is a Scatter Plot of Arrival Delay and Departure Delay

library(plotly)
plot_ly(data = h1, x = ArrDelay, y =DepDelay, mode = "markers")

Scatter Plot sugest a strong correation between the delays times.

Next we calculate the 95% Confedence interval for the difference of the means

t.test(x,y)
## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = -26.106, df = 445800, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.494880 -2.146418
## sample estimates:
## mean of x mean of y 
##  7.094334  9.414983

Derive a correlation matrix for two of the quantitative variables

h=subset(hflights,select=c(ArrDelay,DepDelay))
h1=na.omit(h)
cor(h1$ArrDelay, h1$DepDelay)
## [1] 0.9292181
corm = matrix(c(1,0.929,0.929,1),nrow=2,ncol=2)
corm
##       [,1]  [,2]
## [1,] 1.000 0.929
## [2,] 0.929 1.000

Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval

cor.test(h1$ArrDelay, h1$DepDelay, conf.level = 0.99)
## 
##  Pearson's product-moment correlation
## 
## data:  h1$ArrDelay and h1$DepDelay
## t = 1189.8, df = 223870, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  0.9284710 0.9299578
## sample estimates:
##       cor 
## 0.9292181

Based on above we reject the null hypothesis that the correleation between the variables is 0.

3. Linear Algebra and Corelation

precm = solve(corm)
precm
##           [,1]      [,2]
## [1,]  7.301455 -6.783052
## [2,] -6.783052  7.301455
precm%*%corm
##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1
corm%*%precm
##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1

We are multipling precision matrix by correlation matrix and correlation matrix by precision matrix. As we can see we are getting an Identity Matrix as a result of both multipications. This is to be expected, although normally we get different awnser depending on the order of how we multiply 2 matricies, in this case the matricies are inverses of each outher, so we end up with a 2x2 Identity matrix as a result.

4. Calculus-Based Probability & Statistics

summary(h1$ArrDelay)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -70.000  -8.000   0.000   7.094  11.000 978.000
summary(h1$DepDelay)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -33.000  -3.000   0.000   9.415   9.000 981.000

We can see that minimum values are -70 and -33, so we will shift 71 to make sure minimal value is above 0.

x=h1$ArrDelay+71
summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   63.00   71.00   78.09   82.00 1049.00

Histogram of Shifted Data

plot_ly(x=x, type="histogram")

Histogram of Original Data

plot_ly(x=hflights$ArrDelay, type="histogram")

Next Loading Masspackage and fitting to exponential function

require(MASS)
l=fitdistr(x, "exponential")
l
##        rate    
##   1.280503e-02 
##  (2.706317e-05)
l$estimate
##       rate 
## 0.01280503
samples=rexp(1000, l$estimate )

Below are 5th and 95th percentiles using the cumulative distribution function (CDF)

quantile(samples, probs=0.95)
##      95% 
## 257.8541
quantile(samples, probs=0.05)
##       5% 
## 3.279185

Calculating 95% confidence interval from the empirical data, assuming normality. ( Using z=1.96 (.975) since we are dealing with 2 tails)

mean(x)-1.96*sd(x)
## [1] 17.90564
mean(x)+1.96*sd(x)
## [1] 138.283

Based on above calculations based on 95% Confidece Interval = 17.90564 < M < 138.283

quantile(h1$ArrDelay, probs=0.95)
## 95% 
##  57
quantile(h1$ArrDelay, probs=0.05)
##  5% 
## -18

Above are 5th percentile and 95th percentile of the data